1 Objectives

We, data scientists, work with data and endeavor to make sensible decisions based on data. We analyze data by first understanding the data and then extracting useful information to assist decision making.

One without rich knowledge in medicine and years of experience in clinical practice cannot be a doctor, not to mention a good doctor. Same applies to a data scientist, we need to accumulate experience in dealing with data by hands-on experience. Deciding on what data to collect and what we are looking for from the data are is very important first step toward every successful study. This requires you to equip yourself with a toolset, i.e. R packages/functions, in manipulating, summarizing and visualizing data, which is the cornerstone of our journey towards a great data scientist.

Useful packages:

We will focus on four specific packages dplyr, ggplot, purrr and data.table. dplyr, ggplot, and purrr are part of a broader library called tidyverse which consolidates commonly-used data science packages. See more here: https://www.tidyverse.org/.

  • dplyr can be used for data manipulation, providing a consistent set of verbs that solve the most common data manipulation challenges. Cheatsheet
  • ggplot is used for creating Graphics or Plots. Cheatsheet
  • purrr is a functional programming toolkit that helps you replace for loops to improve readability. Cheatsheet
  • data.table is an R package that provides an enhanced version of data.frames. Its speed makes it a preferable package to manipulate BIG data. Cheatsheet

In addition to tidyverse and data.table, packages skimr, nycflights13, gapminder, ggthemes, ggpubr and plotly will be used in the tutorial. We use the p_load function of pacman, to load all the packages at once. It will first check whether packages are installed. If not, it will install the yet to be installed automatically and then load all the packages.

2 Using dplyr

Before we dig into details about dplyr, let us first find out information about dplyr. We may always want to Google first: r:dplyr. Alternatively, most R-packages provide vignettes which explain the package, usage, etc. We may use vignette() or browseVigenttes("package name") to get the vignette.

browseVignettes("dplyr") # web version
vignette("dplyr") # version inside Rstudio

There are several documents in the vignette for dplyr. Let us explore Introduction to dplyr where it includes dplyr.rmd, dplyr.html and dplyr.R. (The very first draft of the Advanced R Tutorial contains some materials from dplyr.rmd).

We next start to highlight dplyr.

Below is a table of popular dplyr commands. We will go through each one.

dplyr Command SQL equivalent Action
filter() WHERE Limit based on condition
select() SELECT Choose only certain variables
distinct() DISTINCT De-duplicate result-set
arrange() ORDER BY Order results
rename() SELECT Rename variables
mutate() SELECT Create new variable
group_by() GROUP BY Group rows
summarise() SELECT Create new variable in grouped setting
join() JOIN Join tables

You can also view the full cheat sheet here: DPLYR Cheat Sheet

To illustrate these commands, we will use a dataset called flights in the nycflights13 package that contains 336,776 flights departed from New York City in 2013. dplyr allows you to gather insight from a dataset without altering the original dataset. It is considered best practice not to alter the original dataset. For example, in this case, we will never overwrite the existing dataset ‘flights’. We will first take a look at the the dataset and summary statistics.

library(nycflights13) # since we have load nycflights13 in the first r chunk, we do not need to library it again. 
names(flights) # find the variables
dim(flights) # size of the data
str(flights) # structure of the data
summary(flights) # may spot peculiar/unusual...

Notice that there are many missing values in the dataset. This raises alerts to us about the data. Since our focus in this lecture is about various R packages, we will not get into how to deal with missing values at the moment. BUT IT IS A VERY IMPORTANT ISSUE we always need to address in data analyses.

We could use the package skimr to get a better organized summary.

# summary(flights)
skim(flights) # it doesn't seem to report missing values

2.1 Pipes

The %>% command is called a pipe. This means that the result of the code before %>% is sent, or “piped”, to the one after %>%. Piping is a powerful tool for clearly expressing a sequence of multiple operations, as we will see shortly.

Here’s an example of using piping to first square 1:5 and then sum them up. . indicates the previous piping output. The following is equivalent to sum((1:5)^2).

1:5 %>% 
  .^2 %>% 
  sum
## [1] 55

2.2 Filter

The filter command will only display the subset of your dataset that match a certain condition. This command will only show flights on Jan 1st, 2013.

This code is the same as doing filter(flights, month == 1 & day == 1) since the %>% command passes the flights dataframe to the filter command. & means the AND operation, | means the OR operation.

flights %>%
  filter(month == 1 & day == 1)
## # A tibble: 842 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 832 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

It is important to remember that this command does not alter the original flight dataset. If we want to save this subset as its own object, we run the following. Remember the <- is the assignment operator in R.

filteredFlight <- flights %>%
  filter(month == 1 & (day == 1 | day == 2))
filteredFlight
## # A tibble: 1,785 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 1,775 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# View(filteredFlight)

Multiple conditions can be included in a filter command. The command below shows any flights from Jan through June to PHL or SLC airports.

flights %>%
  filter(dest %in% c("PHL","SLC") & month <= 6)
## # A tibble: 2,116 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      655            655         0     1021           1030
##  2  2013     1     1      908            915        -7     1004           1033
##  3  2013     1     1     1047           1050        -3     1401           1410
##  4  2013     1     1     1245           1245         0     1616           1615
##  5  2013     1     1     1323           1300        23     1651           1608
##  6  2013     1     1     1543           1550        -7     1933           1925
##  7  2013     1     1     1600           1610       -10     1712           1729
##  8  2013     1     1     1909           1912        -3     2239           2237
##  9  2013     1     1     1915           1920        -5     2238           2257
## 10  2013     1     1     2000           2000         0     2054           2110
## # … with 2,106 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

2.3 Select

Select will only return columns that are listed. In this case, the resulting dataset will consist of the Origin, Destination, and Carrier of flights that were destined for PHL or SLC in the first 6 months of the year. Remember, the pipe command sends the result of the current line to the next line. In this case, the filtered dataset is then piped into the select command.

flights %>%
  filter(dest %in% c("PHL","SLC") & month <= 6) %>%
  select(origin, dest, carrier)
## # A tibble: 2,116 x 3
##    origin dest  carrier
##    <chr>  <chr> <chr>  
##  1 JFK    SLC   DL     
##  2 LGA    PHL   US     
##  3 JFK    SLC   DL     
##  4 JFK    SLC   DL     
##  5 EWR    SLC   DL     
##  6 JFK    SLC   DL     
##  7 JFK    PHL   9E     
##  8 JFK    SLC   B6     
##  9 JFK    SLC   DL     
## 10 JFK    PHL   9E     
## # … with 2,106 more rows

On the contrary, we can use - to deselect columns. If we want to drop year, month and day, we just need to prefix - to each column name.

flights %>%
  filter(dest %in% c("PHL","SLC") & month <= 6) %>%
  select(-year, -month, -day)
## # A tibble: 2,116 x 16
##    dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
##       <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
##  1      655            655         0     1021           1030        -9 DL     
##  2      908            915        -7     1004           1033       -29 US     
##  3     1047           1050        -3     1401           1410        -9 DL     
##  4     1245           1245         0     1616           1615         1 DL     
##  5     1323           1300        23     1651           1608        43 DL     
##  6     1543           1550        -7     1933           1925         8 DL     
##  7     1600           1610       -10     1712           1729       -17 9E     
##  8     1909           1912        -3     2239           2237         2 B6     
##  9     1915           1920        -5     2238           2257       -19 DL     
## 10     2000           2000         0     2054           2110       -16 9E     
## # … with 2,106 more rows, and 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

2.4 Distinct

Distinct will remove any duplicate rows from the given dataset. Notice in the previous command, it returned a subset with 2116 rows, but with distinct, we can see that only 8 carriers flew to PHL or SLC in the first half of the year.

flights %>%
  filter(dest %in% c("PHL","SLC") & month <= 6) %>%
  select(origin, dest, carrier) %>% 
  distinct()
## # A tibble: 8 x 3
##   origin dest  carrier
##   <chr>  <chr> <chr>  
## 1 JFK    SLC   DL     
## 2 LGA    PHL   US     
## 3 EWR    SLC   DL     
## 4 JFK    PHL   9E     
## 5 JFK    SLC   B6     
## 6 EWR    PHL   EV     
## 7 JFK    PHL   US     
## 8 JFK    PHL   DL

2.5 Arrange

Arrange puts your data into alphabetical order. In this case the order is first by origin, then descending alphabetical order of the destination, then alphabetical order of carrier.

flights %>%
  filter(dest %in% c("PHL","SLC") & month <= 6) %>%
  select(origin, dest, carrier) %>% 
  distinct() %>%
  arrange(origin, desc(dest), carrier)
## # A tibble: 8 x 3
##   origin dest  carrier
##   <chr>  <chr> <chr>  
## 1 EWR    SLC   DL     
## 2 EWR    PHL   EV     
## 3 JFK    SLC   B6     
## 4 JFK    SLC   DL     
## 5 JFK    PHL   9E     
## 6 JFK    PHL   DL     
## 7 JFK    PHL   US     
## 8 LGA    PHL   US

2.6 Rename

The Rename function can be used to easily rename a column Header. Here, we rename carrier to airline. The syntax is as follows. rename(<newname> = <oldname>)

flights %>%
  filter(dest %in% c("PHL","SLC") & month <= 6) %>%
  select(origin, dest, carrier) %>% 
  distinct() %>%
  arrange(origin, desc(dest), carrier) %>%
  rename(airline = carrier)
## # A tibble: 8 x 3
##   origin dest  airline
##   <chr>  <chr> <chr>  
## 1 EWR    SLC   DL     
## 2 EWR    PHL   EV     
## 3 JFK    SLC   B6     
## 4 JFK    SLC   DL     
## 5 JFK    PHL   9E     
## 6 JFK    PHL   DL     
## 7 JFK    PHL   US     
## 8 LGA    PHL   US

2.7 Mutate

Mutate is used to create new columns based on current ones. This feature is very useful. Here, we create three new variables “gain”, “speed”, and “gain_per_hour”. Notice how “gain_per_hour” uses the column “gain”, which was created in the same mutate statement.

flights %>%
  mutate(gain          = dep_delay - arr_delay, 
         speed         = distance / air_time * 60, 
         gain_per_hour = gain / (air_time / 60)) %>% 
  select(dep_delay, arr_delay, gain, distance, distance, air_time, speed, gain_per_hour)
## # A tibble: 336,776 x 7
##    dep_delay arr_delay  gain distance air_time speed gain_per_hour
##        <dbl>     <dbl> <dbl>    <dbl>    <dbl> <dbl>         <dbl>
##  1         2        11    -9     1400      227  370.         -2.38
##  2         4        20   -16     1416      227  374.         -4.23
##  3         2        33   -31     1089      160  408.        -11.6 
##  4        -1       -18    17     1576      183  517.          5.57
##  5        -6       -25    19      762      116  394.          9.83
##  6        -4        12   -16      719      150  288.         -6.4 
##  7        -5        19   -24     1065      158  404.         -9.11
##  8        -3       -14    11      229       53  259.         12.5 
##  9        -3        -8     5      944      140  405.          2.14
## 10        -2         8   -10      733      138  319.         -4.35
## # … with 336,766 more rows

Use transmute if you only want to keep the new variables.

flights %>%
  transmute(gain          = dep_delay - arr_delay, 
            speed         = distance / air_time * 60, 
            gain_per_hour = gain / (air_time / 60)) 
## # A tibble: 336,776 x 3
##     gain speed gain_per_hour
##    <dbl> <dbl>         <dbl>
##  1    -9  370.         -2.38
##  2   -16  374.         -4.23
##  3   -31  408.        -11.6 
##  4    17  517.          5.57
##  5    19  394.          9.83
##  6   -16  288.         -6.4 
##  7   -24  404.         -9.11
##  8    11  259.         12.5 
##  9     5  405.          2.14
## 10   -10  319.         -4.35
## # … with 336,766 more rows

2.8 Group By and Summarise

Group by and summarise often go together to get summary statistics for each group. Reorganize dataframe by rows according to the column that is grouped by; summarise then gives statistics of that group. Here, the origin column had three categories, EWR, JFK, & LGA. The group_by(origin) command organizes the dataframe by the three origins. Then summarise() is used to get metrics related to each origin.

From this table, we can see that EWR had the most flights with 120835, and LGA had the lowest avg delay at 10.34.

flights %>%
  group_by(origin) %>%
  summarise(num_of_flights = n(), 
            avg_delay      = mean(dep_delay, na.rm = TRUE)) # na.rm removes any NA data
## # A tibble: 3 x 3
##   origin num_of_flights avg_delay
##   <chr>           <int>     <dbl>
## 1 EWR            120835      15.1
## 2 JFK            111279      12.1
## 3 LGA            104662      10.3

group_by can also take expressions. The following returns the number of flights that started late AND arrived late; started early AND arrived early (or on time).

flights %>%
  filter(!is.na(dep_delay) & !is.na(arr_delay)) %>%
  group_by(dep_arr_late = dep_delay > 0 & arr_delay > 0) %>%
  summarise(num_of_flights = n())
## # A tibble: 2 x 2
##   dep_arr_late num_of_flights
##   <lgl>                 <int>
## 1 FALSE                235043
## 2 TRUE                  92303

Summarise has a number of other functions that can be used within it. For example, the commonly-used mean(), sum(), max(), min(). Sometimes we might be interested in value in specific position in each group, such that first() and last() retrieves the first and last value respectively. nth(, index) retrieves the nth value. You can combine arrange() and nth() to get the n-th largest value.

flights %>%
  distinct(origin, dest, distance) %>%
  group_by(origin) %>%
  arrange(origin, -distance) %>%
  summarise(thrid_farthest = nth(distance, 3))
## # A tibble: 3 x 2
##   origin thrid_farthest
##   <chr>           <dbl>
## 1 EWR              2565
## 2 JFK              2576
## 3 LGA              1416

n_distinct(dest) returns the number of distinct destinations. From this table we can see that EWR has flights to the largest number of destinations (56). We can also see LGA flights have a lower average distance than those of EWR & JFK.

flights %>%
  group_by(origin) %>%
  summarise(destinations = n_distinct(dest), 
            avg_distance = mean(distance, na.rm = TRUE))
## # A tibble: 3 x 3
##   origin destinations avg_distance
##   <chr>         <int>        <dbl>
## 1 EWR              86        1057.
## 2 JFK              70        1266.
## 3 LGA              68         780.

group_by also takes multiple variables. For example, say we want to know what the farthest flight to leave NYC is. To answer this, we can group by origin and destination, summarise the max distance for each pair, and then order by the maximum distance value we created. It is now easy to see that the max distance was from EWR or JFK to HNL.

flights %>%
  group_by(origin, dest) %>%
  summarise(max_distance = max(distance)) %>%
  arrange(desc(max_distance))
## `summarise()` has grouped output by 'origin'. You can override using the `.groups` argument.
## # A tibble: 224 x 3
## # Groups:   origin [3]
##    origin dest  max_distance
##    <chr>  <chr>        <dbl>
##  1 JFK    HNL           4983
##  2 EWR    HNL           4963
##  3 EWR    ANC           3370
##  4 JFK    SFO           2586
##  5 JFK    OAK           2576
##  6 JFK    SJC           2569
##  7 EWR    SFO           2565
##  8 JFK    SMF           2521
##  9 JFK    LAX           2475
## 10 JFK    BUR           2465
## # … with 214 more rows

In addition to summarise(), group_by() can be useful for other proposes. For example, get the summary statistics of arr_delay by carrier.

flights %>%
  # filter(carrier %in% c("AA", "UA")) %>%
  group_by(carrier) %>%
  skim(arr_delay)

Another example is to find out some of the worst flights of AA and UA.

flights %>%
  filter(carrier %in% c("AA", "UA")) %>%
  group_by(carrier, month) %>%
  filter(rank(-arr_delay) < 3) %>%
  arrange(carrier, month, -arr_delay)
## # A tibble: 48 x 19
## # Groups:   carrier, month [24]
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     2     1607           1030       337     2003           1355
##  2  2013     1     1     2205           1720       285       46           2040
##  3  2013     2    11     1416            810       366     1845           1315
##  4  2013     2    14     1447            920       327     1735           1240
##  5  2013     3     8     1933           1345       348     2231           1705
##  6  2013     3     1     1528            920       368     1738           1240
##  7  2013     4    19      606           1725       761      923           2020
##  8  2013     4    19      617           1700       797      858           1955
##  9  2013     5    19      713           1700       853     1007           1955
## 10  2013     5    16     2233           1340       533       59           1640
## # … with 38 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

2.9 Join

When you have two tables, you might want to merge them together. For example, airlines data is a table of carries’ symbol and their corresponding full name.

airlines
## # A tibble: 16 x 2
##    carrier name                       
##    <chr>   <chr>                      
##  1 9E      Endeavor Air Inc.          
##  2 AA      American Airlines Inc.     
##  3 AS      Alaska Airlines Inc.       
##  4 B6      JetBlue Airways            
##  5 DL      Delta Air Lines Inc.       
##  6 EV      ExpressJet Airlines Inc.   
##  7 F9      Frontier Airlines Inc.     
##  8 FL      AirTran Airways Corporation
##  9 HA      Hawaiian Airlines Inc.     
## 10 MQ      Envoy Air                  
## 11 OO      SkyWest Airlines Inc.      
## 12 UA      United Air Lines Inc.      
## 13 US      US Airways Inc.            
## 14 VX      Virgin America             
## 15 WN      Southwest Airlines Co.     
## 16 YV      Mesa Airlines Inc.

Now we want to get the total number of flights of each carriers with their full name attached. First, we will use group_by() and summarise() to get the total number of flights for each carriers. Then, we use a join() function to join the result table with airlines. The by argument tells join() function which variable is the key to join. In our example, we want to join two tables by the carrier variables, so carrier is our key variable to join.

flights %>%
  group_by(carrier) %>%
  summarise(num_of_flights = n()) %>%
  left_join(airlines, by = "carrier") %>%
  arrange(-num_of_flights)
## # A tibble: 16 x 3
##    carrier num_of_flights name                       
##    <chr>            <int> <chr>                      
##  1 UA               58665 United Air Lines Inc.      
##  2 B6               54635 JetBlue Airways            
##  3 EV               54173 ExpressJet Airlines Inc.   
##  4 DL               48110 Delta Air Lines Inc.       
##  5 AA               32729 American Airlines Inc.     
##  6 MQ               26397 Envoy Air                  
##  7 US               20536 US Airways Inc.            
##  8 9E               18460 Endeavor Air Inc.          
##  9 WN               12275 Southwest Airlines Co.     
## 10 VX                5162 Virgin America             
## 11 FL                3260 AirTran Airways Corporation
## 12 AS                 714 Alaska Airlines Inc.       
## 13 F9                 685 Frontier Airlines Inc.     
## 14 YV                 601 Mesa Airlines Inc.         
## 15 HA                 342 Hawaiian Airlines Inc.     
## 16 OO                  32 SkyWest Airlines Inc.

There are different types of joins: left_join(), right_join(), inner_join(), and full_join().

  • left_join(): keep all observations in x, which is flights in our example. If there’s no corresponding key value in y table, the joined table will have NA.
  • right_join(): keep all observations in y, which is airlines in our example
  • inner_join(): keep match pairs only and drop rows that do NOT have keys in both tables.
  • full_join(): keeps all observations in x and y.

Here’s a Venn diagram to depict them.

To compare different joins, let’s create a look up table that includes United Airlines, American Airlines, and Lufthansa. We compare the join results using inner_join(), left_join() and right_join().

my_airlines <- data.frame(abbr = c("AA", "LH", "UA"),
                          name = c("American Airline", "Lufthansa", "United Airline"))

Since only AA and UA are in both tables, the inner-joined table will only keep AA and UA and drop LH.

# inner_join
flights %>%
  group_by(carrier) %>%
  summarise(num_of_flights = n()) %>%
  inner_join(my_airlines, by = c("carrier" = "abbr")) %>%
  arrange(-num_of_flights)
## # A tibble: 2 x 3
##   carrier num_of_flights name            
##   <chr>            <int> <chr>           
## 1 UA               58665 United Airline  
## 2 AA               32729 American Airline

Left-joined table will keep all observations in x (flights). So for carries that are not in my_airlines, they have NA name.

# left_join
flights %>%
  group_by(carrier) %>%
  summarise(num_of_flights = n()) %>%
  left_join(my_airlines, by = c("carrier" = "abbr")) %>%
  arrange(-num_of_flights)
## # A tibble: 16 x 3
##    carrier num_of_flights name            
##    <chr>            <int> <chr>           
##  1 UA               58665 United Airline  
##  2 B6               54635 <NA>            
##  3 EV               54173 <NA>            
##  4 DL               48110 <NA>            
##  5 AA               32729 American Airline
##  6 MQ               26397 <NA>            
##  7 US               20536 <NA>            
##  8 9E               18460 <NA>            
##  9 WN               12275 <NA>            
## 10 VX                5162 <NA>            
## 11 FL                3260 <NA>            
## 12 AS                 714 <NA>            
## 13 F9                 685 <NA>            
## 14 YV                 601 <NA>            
## 15 HA                 342 <NA>            
## 16 OO                  32 <NA>

Right-joined table will keep all observations in y (my_airlines), so LH has NA number of flights.

flights %>%
  group_by(carrier) %>%
  summarise(num_of_flights = n()) %>%
  right_join(my_airlines, by = c("carrier" = "abbr")) %>%
  arrange(-num_of_flights)
## # A tibble: 3 x 3
##   carrier num_of_flights name            
##   <chr>            <int> <chr>           
## 1 UA               58665 United Airline  
## 2 AA               32729 American Airline
## 3 LH                  NA Lufthansa

More on joining tables, see here.

2.10 Question

Get the cumulative amounts of flights of each airline by month. Hint: use the cumulative sum function cumsum().

3 Using ggplot

ggplot is THE tool for exploratory data analysis. Exploratory data analysis is a way to develop an understanding of your data and we cannot stress more about understanding your data as data scientists. ggplot is one of the best tools for plotting graphs among other software/libraries (Matlab, Python’s matplotlib). Our goal here is to let graphs tell the number using ggplot.

gg means the grammar of graphics. The basic idea of ggplot is to independently specify building blocks and combine them to create just about any kind of graphical display you want. Building blocks of a graph include:

  • data
  • aesthetic mapping (x, y axes; group; color)
  • geometric object (line, bar)
  • faceting

3.1 Aesthetic Mappings

In ggplot land aesthetic means “something you can see”. Examples include:

  • position (i.e., on the x and y axes)
  • color (“outside” color)
  • fill (“inside” color)
  • shape (of points)
  • size

Gapminder is an non-profit organization. The goal is to give correct view about the world. The package gapminder contains a data set also named gapminder. But this dataset is about world life expectancy statistics. Let’s do a quick summary.

summary(gapminder)
# skim(gapminder)

3.2 Plots by Data Types

Data Plots Geom (ggplot command)
One Continuous Histogram geom_histogram
One Continuous + One Categorical Boxplot geom_boxplot
Two Continuous Scatter Plot geom_point
Three Continuous Scatter Plot + Size geom_point w/ size aesthetic
Two Continuous + One Categorical Scatter Plot + Color geom_point w/ color aesthetic
Categorical with reasonable number of levels Faceting!! facet_wrap()

Note: Time is always the x-axis.

There are many more geom types, but we will focus on the ones listed in the table above.

Here is an extremely useful cheatsheet that shows all of ggplots functions and how to use them.

3.3 One Continuous / Geom_Histogram

The following shows the histogram of life expectancy in 2007. Life expectancy is a continuous variable, so we use geom_histogram().

Note how the %>% or “piping” also works with ggplot. If you are not piping in a dataframe, the first input to ggplot should be your dataframe. For example, the command would become ggplot(gapminder, aes(x = lifeExP)) + geom_histogram(binwidth = 2)

hist(gapminder$lifeExp)

gapminder %>%
  ggplot(aes(x = lifeExp)) + 
  geom_histogram(binwidth = 2)

What has caught your eyes right away?

Life expectancy in other years:

summary(gapminder)
##         country        continent        year         lifeExp     
##  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
##  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
##  Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
##  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
##  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
##  Australia  :  12                  Max.   :2007   Max.   :82.60  
##  (Other)    :1632                                                
##       pop              gdpPercap       
##  Min.   :6.001e+04   Min.   :   241.2  
##  1st Qu.:2.794e+06   1st Qu.:  1202.1  
##  Median :7.024e+06   Median :  3531.8  
##  Mean   :2.960e+07   Mean   :  7215.3  
##  3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
##  Max.   :1.319e+09   Max.   :113523.1  
## 
gapminder %>% filter(year == 2005) %>%  # 2005 > no histogram, why not? 
  ggplot(aes(x = lifeExp)) + 
  geom_histogram(binwidth = 2)

# see what year we have
unique(gapminder$year)
##  [1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
gapminder %>% 
  filter(year == 1952) %>%  # change to other years
  ggplot(aes(x = lifeExp)) + 
  geom_histogram(binwidth = 2)

gapminder %>% 
  filter(year == 2007) %>%  # change to other years
  ggplot(aes(x = lifeExp)) + 
  geom_histogram(binwidth = 2)

# put two together using facet_wrap()
gapminder %>% 
  filter(year %in% c(1952, 2007)) %>%
  ggplot(aes(x = lifeExp)) + 
  geom_histogram(binwidth = 2) +
  facet_wrap(~year)

3.4 One Continuous + One Categorical / Geom_boxplot

Now, we want to show lifeExp broken down by continent. Continent is a categorical variable, also called factors in R. For this, we use the geom_boxplot() command.

gapminder %>%
  filter(year == 2007) %>%
  ggplot(aes(x = continent, y = lifeExp)) + 
  geom_boxplot()

3.5 Two Continuous / Geom_Point

Using geom_point() we create a scatter plot of our two continuous variables, gdpPercap and LifeExp.

plot(gapminder$gdpPercap, gapminder$lifeExp, pch=16)

gapminder %>%
  filter(year == 2007) %>%
  ggplot(aes(x = gdpPercap, y = lifeExp)) + 
  geom_point()

Some relationships will look better on different scales, and ggplot allows you to change scales very quickly. Here we log the x-axis, with scale_x_log10(), which makes the relationship between these two variables much clearer. Use size and shape argument in geom_point() to adjust the size and the shape. The options of shape is indexed as follows.

Credit to https://blog.albertkuo.me/post/point-shape-options-in-ggplot/

gapminder %>%
  filter(year == 2007) %>%
  ggplot(aes(x = gdpPercap, y = lifeExp)) + 
  geom_point(size = 1, shape = 5) +
  scale_x_log10()

3.6 Three Continuous / Geom_point With Size Aesthetic

If we want to show three continuous variables at the same time, we can use the size aesthetic in ggplot. This will alter the size of the point by the value in the pop column of the gapminder data frame.

gapminder %>%
  filter(year == 2007) %>%
  ggplot(aes(x = gdpPercap, y = lifeExp, size = pop)) + 
  geom_point() +
  scale_x_log10()

3.7 Two Continuous + One Categorical / Geom_point With Color Aesthetic

To show more insight into this graph, we can show each point by which continent it is from. Adding the color Aesthetic allows us to show a categorical variable, continent, as each point is colored by what continent it is from.

gapminder %>%
  filter(year == 2007) %>%
  ggplot(aes(x = gdpPercap, y = lifeExp)) + 
  geom_point(aes(size = pop, color = continent)) +
  scale_x_log10()

3.8 Linear model line quickly / Geom_smooth

ggplot can also quickly add a linear model to a graph. There are also other models geom_smooth can do (“lm”, “glm”, “gam”, “loess”, “rlm”). If you leaving it blank it will automatically choose one for you, but that is not recommended.

gapminder %>%
  filter(year == 2007) %>%
  ggplot(aes(x = gdpPercap, y = lifeExp)) + 
  geom_point() + 
  geom_smooth(se = TRUE)

To add the linear model line, we add geom_smooth(method = 'lm', se = TRUE) to the command. se = TRUE tells it to plot the standard error ranges on the graph.

gapminder %>%
  filter(year == 2007) %>%
  ggplot(aes(x = log(gdpPercap), y = lifeExp)) + 
  geom_point(aes(col = continent)) + 
  geom_smooth(method = 'lm', formula = 'y~x', se = TRUE)

Note that aes() in ggplot() is a global setting, it will affect all geometric objects. It is the best practice to put aesthetic setting for each geometric objects. See the difference in the next plot.

gapminder %>%
  filter(year == 2007) %>%
  ggplot(aes(x = log(gdpPercap), y = lifeExp, col = continent)) + 
  geom_point() + 
  geom_smooth(method = 'lm', formula = 'y~x', se = TRUE)
## Warning in qt((1 - level)/2, df): NaNs produced
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning -
## Inf

3.9 Trend

So far we have been plotting a cross-sectional plot. We can use geom_line() with x-axis as year and y-axis as life expectancy. We plot the trend of life expectancy of US.

gapminder %>%
  filter(country == "United States") %>%
  ggplot(aes(x = year, y = lifeExp)) +
  geom_line()

Now we want to plot the life expectancy trend of each country. We call it the spaghetti plot. We need to tell geom_line() to group the line by country. We use different colors to separate continents.

gapminder %>%
  ggplot(aes(x = year, y = lifeExp)) +
  geom_line(aes(group = country, col = continent))

It is a bit hard to see the pattern. We can add the trend of mean of each continent in our plot. In addition to the trend, we need to call another geom_line() to plot the mean. To make the mean trend stand out, we set alpha for the transparency of the spaghetti.

gapminder %>%
  ggplot(aes(x = year, y = lifeExp)) +
  geom_line(aes(group = country, col = continent), alpha = .2) +
  geom_line(data = gapminder %>% 
              group_by(continent, year) %>% 
              summarise(mean = mean(lifeExp)),
            aes(x = year, y = mean, group = continent, col = continent), 
            size = 1)
## `summarise()` has grouped output by 'continent'. You can override using the `.groups` argument.

Another trend plot is that we want to see how each group changes. We can use barplot with fill by each group. There are three position options for geom_bar():

  • stack: stacking number of each group together
  • dodge: spread out each group
  • fill: get the proportion of each group
gapminder %>%
  ggplot(aes(x = year, y = pop, fill = continent)) + 
  geom_bar(position = "stack", stat = "identity") 

gapminder %>%
  ggplot(aes(x = year, y = pop, fill = continent)) + 
  geom_bar(position = "fill", stat = "identity") 

3.10 Faceting

Instead of changing the color of points on the graph by continent, you can also create a different graph for each continent by ‘faceting’. Depending on the number of factors and your dataset, faceting may look better than just changing colors. To do this we add the facet_wrap(~ continent) command.

gapminder %>%
  filter(year == 2007) %>%
  ggplot(aes(x = gdpPercap, y = lifeExp)) +
  geom_point() +
  scale_x_log10() +
  facet_wrap(~continent)

You can facet with any geom type. For example, our spaghetti plot.

gapminder %>%
  ggplot(aes(x = year, y = lifeExp)) +
  geom_line(aes(group = country, col = continent), alpha = .2) +
  geom_line(data = gapminder %>% 
              group_by(continent, year) %>% 
              summarise(mean = mean(lifeExp)),
            aes(x = year, y = mean, group = continent, col = continent), 
            size = 1) +
  facet_wrap(~ continent)
## `summarise()` has grouped output by 'continent'. You can override using the `.groups` argument.

Another example, linear model of each continent. We can use scales = "free" in facet_wrap() to free the scales of x, y axes. It is generally NOT recommend to do so since it can be misleading.

gapminder %>%
  filter(year == 2007) %>%
  ggplot(aes(x = log(gdpPercap), y = lifeExp)) + 
  geom_point() + 
  geom_smooth(method = 'lm', formula = 'y~x', se = TRUE)+
  facet_wrap(~ continent)
## Warning in qt((1 - level)/2, df): NaNs produced
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning -
## Inf

3.11 Multiple plots: ggpubr::ggarange()

To put multiple plots together, use ggarrange() from ggpubr package.

p1 <- gapminder %>%
  ggplot(aes(x = year, y = lifeExp)) +
  geom_line(aes(group = country, col = continent), alpha = .2) +
  geom_line(data = gapminder %>% 
              group_by(continent, year) %>% 
              summarise(mean = mean(lifeExp)),
            aes(x = year, y = mean, group = continent, col = continent), 
            size = 1)
## `summarise()` has grouped output by 'continent'. You can override using the `.groups` argument.
p2 <- gapminder %>%
  ggplot(aes(x = year, y = pop, fill = continent)) + 
  geom_bar(position = "stack", stat = "identity") 
# put p1 and p2 together
p <- ggpubr::ggarrange(p1, p2, common.legend = T, legend = "bottom", ncol = 2)
annotate_figure(p,
               top = text_grob("Life expectancy and Population", 
                               color = "black", face = "bold", size = 14))

3.12 Saving plots ggsave()

Use ggsave() to save plots to png, jpeg or pdf. It plots the previous plot until we specify which plot to save.

ggsave(filename = "fig/life_exp_population.pdf", width = 8, height = 5)

3.13 plotly package

Plotly is an interactive web-based library. It is well-integrated with ggplot2 via ggplotly(). You can feed in any ggplots to ggplotly(), it will output an interactive plot. Note that since it is web-based, it only works when you knit to html.

p <- gapminder %>%
  ggplot(aes(x = year, y = lifeExp)) +
  geom_line(aes(group = country, col = continent), alpha = .2) +
  geom_line(data = gapminder %>% 
              group_by(continent, year) %>% 
              summarise(mean = mean(lifeExp)),
            aes(x = year, y = mean, group = continent, col = continent), 
            size = 1)
## `summarise()` has grouped output by 'continent'. You can override using the `.groups` argument.
ggplotly(p)

3.13.1 Animation

One thing worth mentioning is animation with plotly. Use frame argument.

p <- ggplot(gapminder, aes(gdpPercap, lifeExp, color = continent)) +
  geom_point(aes(size = pop, frame = year, ids = country)) +
  scale_x_log10()
## Warning: Ignoring unknown aesthetics: frame, ids
ggplotly(p)

3.14 Theming

ggplot makes it very easy to create very nice graphs, using whats called a theme. There are many default themes or you can make your own. Almost everything is customizable on ggplot, including sizes, colors, font types, etc. Below is a example of building a theme.

To use a theme, it is simply added on the end of your ggplot string of commands. You can also add titles with ggtitle(), change labels with xlab() and ylab() command.

3.14.1 Scale

We might want to change the label or color of the legend. Scale family will come in handy.

We can use scale_color_brewer() or scale_fill_brewer() with palette argument depends on you are using col or fill aesthetics. Here are the available palettes:

  • Diverging: BrBG, PiYG, PRGn, PuOr, RdBu, RdGy, RdYlBu, RdYlGn, Spectral
  • Qualitative: Accent, Dark2, Paired, Pastel1, Pastel2, Set1, Set2, Set3
  • Sequential: Blues, BuGn, BuPu, GnBu, Greens, Greys, Oranges, OrRd, PuBu, PuBuGn, PuRd, Purples, RdPu, Reds, YlGn, YlGnBu, YlOrBr, YlOrRd
gapminder %>%
  filter(year == 2007) %>%
  ggplot(aes(x = log(gdpPercap), y = lifeExp)) + 
  geom_point(aes(col = continent)) +
  scale_color_brewer(palette = "Spectral", 
                     labels = c("AFRICA", "AMERICA", "ASIA", "EUROPE", "OCEANIA")) + 
  labs(colour = "CONTINENT")

gapminder %>%
  ggplot(aes(x = year, y = pop, fill = continent)) + 
  geom_bar(position = "stack", stat = "identity") +
  scale_fill_brewer(palette = "Spectral", direction = -1)

We can also scale by manual or scale by gradient for continuous variable.

3.14.2 ggthemes

You can also use pre-existing themes. ggthemes recreates popular themes from websites and magazines. See its gallery here. Here is one based on The Economist magazine.

4 Using apply and purrr::map family

In basic R tutorial, we mentioned that R is a functional programming language. Functional programming languages write programs by applying and composing functions. It allows us to pass a function into a function.

For example, if we want to get the mean and sd for each column of a data frame, we can do it one by one, or by a for loop, or using the advantage of R being a functional programming language. (Remember that each column of one data frame is a list!)

gapminder_sub <- gapminder %>% select(lifeExp, pop)
# get mean
gapminder_sub_mean <- rep(NA, ncol(gapminder_sub))
for(i in seq_along(gapminder_sub)) {
  gapminder_sub_mean[i] <- mean(gapminder_sub[[i]])
}
# get sd
gapminder_sub_sd <- rep(NA, ncol(gapminder_sub))
for(i in seq_along(gapminder_sub)) {
  gapminder_sub_sd[i] <- sd(gapminder_sub[[i]])
}

Then we might come up with more statistics to calculate for each column. Copying and pasting the for loop code again makes your code redundant and hard to read. And this practice is always prone to error and very inflexible. More importantly, as your code get complicated, it gets very hard to debug. What if we want to do the same thing to other data set?

R allows us to pass a function to function. Instead of writing three functions that have a similar structure with only different being mean() or sd(), we can pass a function as argument into a function.

col_summary <- function(df, fun) {
  output <- rep(NA, ncol(df))
  for(i in seq_along(df)) {
    output[i] <- fun(df[[i]])
  }
  output
}
col_summary(gapminder_sub, mean)
## [1] 5.947444e+01 2.960121e+07

We can even pass our self-defined function into col_summary().

log_mean <- function(x) {
  mean(log(x))
}
col_summary(gapminder_sub, log_mean)
## [1]  4.059839 15.766110

Apply a function to each element of a list/vector is so common that apply class function and purrr package are designed specifically for this purpose. They are implemented in C, which makes them a little faster.

4.1 lapply() and map()

lapply() and map() takes a vector or a list and apply a function to each element and output a list. Function arguments (e.g. na.rm=T) follows function name.

lapply(gapminder_sub, mean, na.rm = T)
## $lifeExp
## [1] 59.47444
## 
## $pop
## [1] 29601212
map(gapminder_sub, mean, na.rm = T)
## $lifeExp
## [1] 59.47444
## 
## $pop
## [1] 29601212

4.2 sapply()/vapply() and map_*()

sapply() and vapply() are wrappers around lapply() that simplifies the output from a list to vector/matrix. They basically do the same thing but vapply() is considered to be a safer choice because you have to specify the type of the output. map_*() is a simple alternative for vapply() by specifying the return type.

map_*() return
map_lgl() logical vector
map_int() integer vector
map_dbl() double vector
map_chr() character vector

For example, we want to check the class of each column.

sapply(gapminder, class)
##   country continent      year   lifeExp       pop gdpPercap 
##  "factor"  "factor" "integer" "numeric" "integer" "numeric"
vapply(gapminder, class, character(1))
##   country continent      year   lifeExp       pop gdpPercap 
##  "factor"  "factor" "integer" "numeric" "integer" "numeric"
map_chr(gapminder, class)
##   country continent      year   lifeExp       pop gdpPercap 
##  "factor"  "factor" "integer" "numeric" "integer" "numeric"

vapply() and sapply() can output matrix while map_*() can only output vector.

vapply(gapminder_sub, summary, numeric(6))
##          lifeExp        pop
## Min.    23.59900      60011
## 1st Qu. 48.19800    2793664
## Median  60.71250    7023596
## Mean    59.47444   29601212
## 3rd Qu. 70.84550   19585222
## Max.    82.60300 1318683096

4.3 Anonymous function

We can write function within apply and map class function as well!

vapply(gapminder_sub, function(x) mean(log(x)), numeric(1))
##   lifeExp       pop 
##  4.059839 15.766110
map_dbl(gapminder_sub, function(x) mean(log(x)))
##   lifeExp       pop 
##  4.059839 15.766110

5 Using data.table

As an enhanced version of data.frame, one can do a lot more than subsetting rows and selecting columns within the frame of data.table, i.e., within [ ... ]. The general form of data.table syntax is as shown below:

## DT[i, j, by]
## #   R:        i            j        by
## # SQL:  where   select | update  group by

The way to read it (out loud) is as suggested:

Take DT, subset rows using i, then calculate j, grouped by by.

Let’s redo the work by dplyr above in the data.table way. This tutorial will only cover the basics of data.table. There are more advanced data.table operations that come in handy. See the data.table cheat sheet here. Comparisons between dplyr, data.table and pandas (a popular Python package) in terms of speed can be found here.

5.1 fread() and fwrite()

fread() and fwrite() are similar to read.csv() and write.csv() for reading and writing data respectively but they are much faster, especially for BIG dataset.

5.2 Create data.table

setDT transforms data.frame into data.table. Notice that since data.table is an enhanced version of data.frame, operations on data.frame are still available to data.table. We first make a copy of flights and transform it into data.table.

flights2 <- copy(flights)
setDT(flights2)
class(flights2)
## [1] "data.table" "data.frame"

5.3 Subset rows in i

To show flights on Jan 1st, 2013:

dplyr:

flights %>% filter(month == 1 & day == 1)

data.table:

flights2[month == 1 & day == 1]

To get any flights from Jan through June to PHL or SLC airports:

dplyr:

flights %>% filter(dest %in% c("PHL","SLC") & month <= 6)

data.table:

flights2[dest %in% c("PHL","SLC") & month <= 6]

To get the first N row: dplyr:

flights[1:2,]

data.table:

flights2[1:2]

5.4 Sort rows in i

To sort column origin in ascending order, and then by dest in descending order: dplyr:

flights %>% 
  select(origin, dest, carrier) %>% 
  distinct() %>%
  arrange(origin, desc(dest), carrier)

data.table:

unique(flights2[order(origin, -dest), .(origin, dest, carrier)])

5.5 Select column(s) in j

To select Origin, Destination, and Carrier of flights: dplyr:

flights %>% select(origin, dest, carrier)

data.table:

flights2[, list(origin, dest, carrier)]

or

flights2[, .(origin, dest, carrier)]

or you can also store the columns’ names into a variable and select them using .SDcols and .SD (Subset of Data.table). data.table also has a special syntax .. for selecting columns by a vector.

select_columns <- c("origin", "dest", "carrier")
flights2[, .SD, .SDcols = select_columns]
# flights2[, ..select_columns]

When selecting only one column, data.table allows us to return the column as a vector as what the base $ does.

If

To select origin and return it as a vector: dplyr:

flights$origin
flights %>% pull(origin)

data.table:

flights2[, origin]

To rename dest to destination and carrier to airline:

dplyr:

flights %>%
  select(origin, dest, carrier) %>% 
  rename(dest = destination, airline = carrier)

data.table:

renames(flights2, 
        old = c("dest", "airline"), 
        new = c("destination", "carrier"))

Notice that in dplyr, the syntax is rename(old_name = new_name) which be read as “rename old_name to new_name”.

5.6 Compute in j

Other than selecting columns, j can also handle expressions. For example,

To compute how many trips had total delay < 0: base:

sum((flights$arr_delay + flights$dep_delay) < 0, na.rm = T)
## [1] 188401

data.table:

flights2[, sum((arr_delay + dep_delay) < 0, na.rm = T)]
## [1] 188401

5.7 Special symbol .N

There are some special symbol in data.table, e.g. .N, .SD, .SDcols. .N holds the number of observations in the current group. Think of it as a counterpart of n() of dplyr.

To count the total number of flights: dplyr:

flights %>%
  summarise(num_of_flights = n())
## # A tibble: 1 x 1
##   num_of_flights
##            <int>
## 1         336776

data.table:

flights2[, .(num_of_flights = .N)]
##    num_of_flights
## 1:         336776

5.8 Special symbol :=

To create new columns as what mutate of dplyr does:

dplyr:

flights %>%
  mutate(gain = dep_delay - arr_delay)

data.table:

flights2[, gain := dep_delay - arr_delay]

To remove gain: Use := NULL to remove columns.

flights2[, gain := NULL]
## Warning in `[.data.table`(flights2, , `:=`(gain, NULL)): Column 'gain' does not
## exist to remove

5.9 Grouping using by

Similar to group_by of dplyr, data.table applies the operations in j to groups specified in by.

To count the total number of flights from each origin and average delay time of each: dplyr:

flights %>%
  group_by(origin) %>%
  summarise(num_of_flights = n(),
            avg_delay      = mean(dep_delay, na.rm = TRUE))

data.table:

flights2[, .(num_of_flights = .N, 
             avg_delay      = mean(dep_delay, na.rm = TRUE)), 
         by = origin]

by also takes expressions like group_by does.

To count the number of flights that started late but arrived early (or on time), started and arrived late etc.: dplyr:

flights %>%
  filter(!is.na(dep_delay) & !is.na(arr_delay)) %>%
  group_by(dep_delay > 0, arr_delay > 0) %>%
  summarise(num_of_flights = n())

data.table:

flights2[!is.na(dep_delay) & !is.na(arr_delay), 
         .N, by = .(dep_delay>0, arr_delay>0)]

5.10 apply function

data.table supports lapply() function to operation on columns. The following code get the total number of NA of each column. Remember .SDcols and .SD (Subset of Data.table) are for selecting columns.

vapply(flights, function(x) sum(is.na(x)), numeric(1))
##           year          month            day       dep_time sched_dep_time 
##              0              0              0           8255              0 
##      dep_delay       arr_time sched_arr_time      arr_delay        carrier 
##           8255           8713              0           9430              0 
##         flight        tailnum         origin           dest       air_time 
##              0           2512              0              0           9430 
##       distance           hour         minute      time_hour 
##              0              0              0              0
flights2[, lapply(.SD, function(x) sum(is.na(x))), .SDcols = names(flights)]
##    year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1:    0     0   0     8255              0      8255     8713              0
##    arr_delay carrier flight tailnum origin dest air_time distance hour minute
## 1:      9430       0      0    2512      0    0     9430        0    0      0
##    time_hour
## 1:         0

5.11 Joining using merge()

To join two tables using data.table, use merge() function. The key variable(s) are passed via argument by if the key variable has same name, or using by.x to indicate the key variable for x table and by.y for y table. It uses argument all.x, all.y, all to decide which join to operate.

  • left_join(): all.x=TRUE
  • right_join(): all.y=TRUE
  • inner_join(): default
  • full_join(): all=TRUE

dplyr:

flights %>%
  group_by(carrier) %>%
  summarise(num_of_flights = n()) %>%
  left_join(airlines, by = "carrier") %>%
  arrange(-num_of_flights)
## # A tibble: 16 x 3
##    carrier num_of_flights name                       
##    <chr>            <int> <chr>                      
##  1 UA               58665 United Air Lines Inc.      
##  2 B6               54635 JetBlue Airways            
##  3 EV               54173 ExpressJet Airlines Inc.   
##  4 DL               48110 Delta Air Lines Inc.       
##  5 AA               32729 American Airlines Inc.     
##  6 MQ               26397 Envoy Air                  
##  7 US               20536 US Airways Inc.            
##  8 9E               18460 Endeavor Air Inc.          
##  9 WN               12275 Southwest Airlines Co.     
## 10 VX                5162 Virgin America             
## 11 FL                3260 AirTran Airways Corporation
## 12 AS                 714 Alaska Airlines Inc.       
## 13 F9                 685 Frontier Airlines Inc.     
## 14 YV                 601 Mesa Airlines Inc.         
## 15 HA                 342 Hawaiian Airlines Inc.     
## 16 OO                  32 SkyWest Airlines Inc.

data.table:

num_of_flights_by_carrier <- merge(flights2[, .(num_of_flights = .N), by = "carrier"],
                                   airlines,
                                   by = "carrier",
                                   all.x = T)
num_of_flights_by_carrier[order(-num_of_flights)]
##     carrier num_of_flights                        name
##  1:      UA          58665       United Air Lines Inc.
##  2:      B6          54635             JetBlue Airways
##  3:      EV          54173    ExpressJet Airlines Inc.
##  4:      DL          48110        Delta Air Lines Inc.
##  5:      AA          32729      American Airlines Inc.
##  6:      MQ          26397                   Envoy Air
##  7:      US          20536             US Airways Inc.
##  8:      9E          18460           Endeavor Air Inc.
##  9:      WN          12275      Southwest Airlines Co.
## 10:      VX           5162              Virgin America
## 11:      FL           3260 AirTran Airways Corporation
## 12:      AS            714        Alaska Airlines Inc.
## 13:      F9            685      Frontier Airlines Inc.
## 14:      YV            601          Mesa Airlines Inc.
## 15:      HA            342      Hawaiian Airlines Inc.
## 16:      OO             32       SkyWest Airlines Inc.

5.12 Chaining

data.table can tack expressions one after another, forming a chain of operations similar to piping %>%, i.e., DT[ … ][ … ][ … ].

flights2[, .(max_distance = max(distance)), by = .(origin, dest)
][order(-max_distance)]

6 dplyr or data.table?

Choosing dplyr or data.table is a personal preference. Here is Stack Overflow post concerning this question. Both authors of dplyr (Hadley) and data.table (Arun) compare their packages in terms of speed, memory usage and syntax. In short, dplyr is more readable (though readability is a subjective question); while data.table performs faster than dplyr as the data size grows, and uses less memory in several functions by its nature. A benchmark comparison between dplyr, data.table and pandas of Python is also available here and a more comprehensive comparison here.

If you really want the pros of both, dtplyr would be a solution. dtplyr allows you to write dplyr code and translate to data.table.

7 Running RStudio on Server/Cloud Services

Lastly, we introduce a way to run Rstudio in the cloud based on Rstudio Server. This can come in handy when doing more complex analysis with larger datasets, as it will greatly speed up processing times.

There are many existing guides to help you set up an RStudio instance on the cloud with Google or Amazon Web Services.

Here are a few good guides for using Google Cloud Services:

and Amazon Web Services: